Diagnostic signature for heart failure with preserved ejection fraction (HFpEF): a machine learning approach using multi-modality electronic health record data

Background Heart failure with preserved ejection fraction (HFpEF) is thought to be highly prevalent yet remains underdiagnosed. Evidence-based treatments are available that increase quality of life and decrease hospitalization. We sought to develop a data-driven diagnostic model to predict from electronic health records (EHR) the likelihood of HFpEF among patients with unexplained dyspnea and preserved left ventricular EF. Methods and results The derivation cohort comprised patients with dyspnea and echocardiography results. Structured and unstructured data were extracted using an automated informatics pipeline. Patients were retrospectively diagnosed as HFpEF (cases), non-HF (control cohort I), or HF with reduced EF (HFrEF; control cohort II). The ability of clinical parameters and investigations to discriminate cases from controls was evaluated by extreme gradient boosting. A likelihood scoring system was developed and validated in a separate test cohort. The derivation cohort included 1585 consecutive patients: 133 cases of HFpEF (9%), 194 non-HF cases (Control cohort I) and 1258 HFrEF cases (Control cohort II). Two HFpEF diagnostic signatures were derived, comprising symptoms, diagnoses and investigation results. A final prediction model was generated based on the averaged likelihood scores from these two models. In a validation cohort consisting of 269 consecutive patients [with 66 HFpEF cases (24.5%)], the diagnostic power of detecting HFpEF had an AUROC of 90% (P < 0.001) and average precision of 74%. Conclusion This diagnostic signature enables discrimination of HFpEF from non-cardiac dyspnea or HFrEF from EHR and can assist in the diagnostic evaluation in patients with unexplained dyspnea. This approach will enable identification of HFpEF patients who may then benefit from new evidence-based therapies. Supplementary Information The online version contains supplementary material available at 10.1186/s12872-022-03005-w.


Introduction
Heart Failure with preserved ejection fraction (HFpEF) is a highly prevalent yet under-diagnosed clinical syndrome [1,2]. The hallmarks are the signs and symptoms of heart failure (HF) and a preserved left ventricular ejection fraction (LVEF). While the diagnosis of HFpEF is straightforward in acutely decompensated patients, stable euvolemic patients present a greater challenge Page 2 of 13 Farajidavar et al. BMC Cardiovascular Disorders (2022) 22:567 [3]. Exertional dyspnea is non-specific and occurs in many other conditions. Specialist diagnostic tests e.g. expert echocardiography for diastolic dysfunction or invasive cardiac catheterization to document raised LV filling pressures may not be immediately available to the non-specialist. A recent study found that among more than 44,000 community-based patients likely to have HF, only 50% had a documented LVEF [4]. Furthermore, those eventually diagnosed as having HFpEF required many more pre-diagnosis investigations and consultations than HFrEF patients. From a patient perspective, a diagnosis of HFpEF confers a high degree of morbidity as well as mortality rates equivalent to many forms of cancer [5]. Rates of readmission to hospital are high [6] and are associated with adverse outcomes [7]. From a healthcare system perspective, HFpEF is associated with significant costs due to frequent hospitalisation, with the median length of stay up to 19 days [8].
Until recently, no effective therapies were available for HFpEF [9][10][11], however recent clinical trial evidence suggests that sodium-glucose co-transporter 2 (SGLT- 2) inhibitors are effective at decreasing hospitalization while increasing quality of life [12]. The presence of effective therapies highlights the need to identify patients who may derive benefit.
In previous epidemiological studies, identification and extraction of HFpEF cases from Electronic Health Records (EHR) has typically relied on diagnostic codes, additional medical record abstraction, and/or adjudication based on various expert criteria e.g. European Society of Cardiology criteria [13]. The EHR is however increasingly amenable to rapid and automated extraction of multiple clinical parameters, including the use of advanced natural language processing (NLP) algorithms to identify clinical concepts recorded in the unstructured text [14][15][16].
The aim of this study was to extract and analyze multimodality data from the EHR using a machine learning approach to develop an automated prediction tool to identify patients likely to have HFpEF.

Derivation cohort
We performed a retrospective study using de-identified data of patients attending King's College Hospital NHS Foundation Trust (KCH) in London (UK) between 2000 and 2019. We focused on patients who had undergone echocardiography as part of their inpatient or outpatient evaluation. With this starting point, a number of different patient cohorts were derived based on the LVEF, confirmed or possible HF, symptoms of dyspnea, and NT-proBNP (or BNP) level (see Additional file 1: Sections I and II). We identified confirmed HFpEF cases and two control cohorts: those with no evidence of HF (non-HF, Control cohort I) and those with HFrEF (Control cohort II). HFpEF cases were defined as patients with a preserved LVEF ≥ 50% (with no evidence of LVEF < 50% at any stage), a confirmed diagnosis of HF based on ICD10 codes I50.0, I50.1 or I50.9, dyspnea, and a raised NT-proBNP or BNP level (according to age-specific thresholds), in accordance with ESC diagnostic criteria [13]. Non-HF control cohort I was defined as no recorded diagnosis of HF, no dysponea, no reduced BNP and normal LVEF. HFrEF control cohort II was defined as having a recorded diagnosis of HF and reduced LVEF (i.e. < 50%). Patients with valvular heart disease (ICD10 codes I05-I09 and I35) were excluded.

Test cohorts
We generated 4 test cohorts from patients who lacked at least one of the above diagnostic features for a confirmed diagnosis of HFpEF (see Additional file 1: Table S1 and Flowchart S1). We randomly sampled 100 patients from each of these four test subsets for analysis and removed samples where the clinical annotations disagreed or there was more than 70% missingness in signature predictors, leaving 269 in total.

Data extraction and evaluation
Clinical and demographic data were retrieved from the structured and unstructured components of the EHR using the CogStack informatics platform [15]. Automated parsing of the EHR was achieved with a state-of-the-art enterprise search and well-validated natural language processing (NLP) tools, including MedCAT [16] and the Unified Medical Language System repository [17] as previously used by our group [18]. Clinical term extraction was restricted to concepts which represent clinical findings, diseases (apart from HF), medications, and signs and symptoms. This was linked to searches of structured data from an internal database containing echocardiographic data and ICD codes. Continuous variables were cleaned prior to cohort selection; e.g. conversion of text references of LVEF to numerical values and removal of measurement outliers (see Additional file 1: Section III). We used both platforms to arbitrate discrepancies in our derivation dataset as neither source proved to be comprehensive, in line with previous work [15,16].
Echocardiographic data were based on formal studies performed according to British Society of Echocardiography guidelines (which are consistent with American and European guidelines) [19,20]. In addition to collecting structured data from the echocardiographic dataset, we also collected numerical data that had been reporteded in the EHR text. For situations where a numerical value for LVEF had not been included in the echocardiogram report, we used a deep learning model to infer whether the LVEF was preserved based on written summary text of the echocardiogram report (see Additional file 1: Section III). BNP or NT-proBNP results were obtained from samples drawn at any time in the study period and the maximum value for each subject was used.
All cases in the derivation dataset that were identified by the data pipeline as HFpEF were validated by manual review of the EHR by a cardiologist.

Potential modeling predictors
A binary diagnostic outcome indicating the presence or absence of HFpEF was considered for modeling. Potential predictors to be included in a diagnostic signature included those used in previous HFpEF epidemiological studies [21,22]. In addition, we adopted a comprehensive approach that included physiological variables, laboratory results, echocardiographic data and clinical concept references [23]. Structured data were collected within a two-month window around the last echocardiography result (or NTproBNP/BNP test result if available). Unstructured data were analyzed from the entire EHR prior to the date of the echocardiography result for each patient.
We made a second level predictor grouping according to whether the variables were initially recorded as (a) structured data: demographic and physiological parameters, and laboratory and echocardiography measurements; or (b) unstructured text in the EHR, extracted via the NLP platform. We adopted the bag-of-words [24] approach to transform clinical concept annotation into word vectors for modeling purposes. Concepts which were mentioned in < 10% of the derivation cohort were excluded. Data from the other predictor categories were collected and imputed prior to training, using the k-nearest neighbor (Scikit-learn python package v0.22) after min-max normalization. Following imputation, data items were rescaled into their original range to preserve the explainability of the final model.

Data modeling, feature selection and validation
We used the tree-based multivariable extreme gradient boosting [25] algorithm (XGBoost, python package v0.9) for modeling, enabling inclusion of mixed data types and smooth handling of missing values and sparsity issues. As such, when a value is missing in the sparse predictor vector, the instance is classified into a default direction (see [25] for further details) that is learnt as optimal using derivation data.
SHAP [26] analysis (SHapley Additive exPlanations; SHAP python package v0.33) was used to order the predictors according to their prominence in discriminating cases from controls. Once the full model was created, we took a stepwise forward insertion scheme to include the more significant variables one at a time, in order to determine the minimal number of predictors that gave an acceptable performance relative to the use of all predictors. The final predictive models were trained and evaluated using the obtained optimal subset of predictors.
Model validation was undertaken in the test cohorts described earlier, using clinical assessment criteria from the H 2 FPEF score [3] as a comparator. A random sample of 400 patients from the test datasets was manually reviewed by two teams each comprising two cardiologists, in order to validate diagnoses. Any cases of clinician disagreement were removed from the evaluation, leaving a total of 269 patients in the test datasets (see Results, Table 1).

Statistical analysis of predictors
Data are presented as mean and standard deviation (SD) or median and interquartile range (IQR) as appropriate. Differences between cases and controls were evaluated by the Mann-Whitney U test or unpaired t test, as appropriate. The area under the receiver-operating characteristic curve (AUROC), F1-score (macro and weighted) and average precision (AP) were used as performance metrics. The F1 score measures the performance of a classifer as the harmonic mean of precision (true positives as a proportion of all positive predictions) and recall (proportion of all positives correctly identified by the model), placing equal importance on both. Average precision is the weighted mean of precision scores obtained as the classification threshold is adjusted (therefore changing the model recall), with the change in recall used as the weight.
A stratified fivefold cross-validation scheme (to ensure each fold is a good representative of the whole data in terms of class prevalence) was utilized for feature selection and derivation set validation. As such, the derivation data was divided into five subsets, four of which were used for training the model and the final one for validation/testing. The derivation and test subsets were shuffled until all five subsets were evaluated. The final performance was then reported as mean and standard deviation of all 5 tests (see Fig. 1).
The AUROC and AP were used as performance metrics and the Kappa statistic was used to measure the inter-rater agreement of proposed models. All tests were 2-sided, with P < 0.05 considered significant.
To evaluate the generalizability of the model to a new sample, Harrell optimism was calculated with 1000 boot-strap replicates [27]. To evaluate discrimination power of the proposed model beyond existing criteria,

Results
1854 patients were included in the study of whom 1585 were in the derivation cohort (Table 1). HFpEF patients in the derivation cohort (n = 133) were older than those with HFrEF and those without heart failure (non-HF), with a higher proportion of females and a higher BMI. They also had a higher prevalence of hypertension, atrial fibrillation, diabetes and chronic kidney disease. Systolic and diastolic pressures were higher in the HFpEF group compared to HFrEF. Patients with HFpEF had lower enddiastolic and end-systolic volumes and higher septal E/e' ratios than the non-HF control group.

Diagnostic signatures for HFpEF diagnosis
Our first step was to determine model performance in predicting non-HF versus HFpEF and HFpEF versus HFrEF, and to identify the most useful features in each case.
The minimum number of features required to distinguish HFpEF from non-HF was 30, while the minimum number required to distinguish HFpEF from HFrEF was 29. These features and their relative importance in discriminating HFpEF from non-HF and HFrEF are shown in Fig. 2. Dyspnea and 'pharmacologic substance' were the most prominent predictors in discrimination against non-HF whereas LVEF was most important for discrimination against HFrEF. However, many of the features (e.g. age, patient address) were common to the two groups. The text references to "patient address" and "pharmacologic substance" (detected when the text refers to medication) were interpreted as surrogate predictors of the number of complete hospital attendances. (Fig. 2).
We found that a combined model using both structured and unstructured data has better performance compared to using either structured or unstructured data alone ( Table 2). This enhanced performance is more noted in discriminating HFpEF from HFrEF than discriminating HFpEF from non-HF (due to the dominancy of unstructured predictors in the HFpEF v non-HF model, see Fig. 2 and Table 3).

Selection of the final model and evaluation in test cohorts
The final model that was used for test evaluations aggregates the HFpEF versus HFrEF and HFpEF versus non-HF signature likelihood predictions through an averaging operation. It therefore uses all features from both component models (Table 4). In the final "aggregated" model, a patient is predicted to have HFpEF if the average predicted probability of HFpEF versus non-HF and versus HFrEF is > = 0.5. The idea of the aggregated model is to aid discrimination between HFpEF and related conditions. We used this aggregate model to make predictions on the test sets. Additional file 1: Figure S5 summarises the entire processing and model training pipeline, while Additional file 1: Figure S6 gives details of model adaptation [28].
The performance of both proposed base models and the final aggregated model remained robust in the test cohort as compared to expert clinical consensus, with an   (Fig. 3). Lastly, we compared the final aggregate model as well as the baseline HFpEF versus non-HF and HFpEF versus HFrEF models with the recently described H 2 FPEF model. The AUROC and average precision of both the aggregate model and the individual baseline models was higher than the H 2 FPEF model ( Table 5). We additionally used the Cohen's kappa score to report on the agreement between the predictions made by our proposed baseline HFpEF versus non-HF and HFpEF versus HFrEF models to better highlight the efficiency of the aggregate model over the individual base models discriminating HFpEF from non-HF and HFrEF. The positive kappa score of 0.3 indicates a weak agreement between the two base models (i.e. can make different predictions for whether HFpEF is present in the same patient). This was expected as the test cohort had lower availability of clinical assessments compared to the derivation cohort. Together with the improved overall performance, this result supports the use of the aggregated model.

Discussion
In this study, we have developed an automated pipeline for EHR-based data collection, processing and modeling to identify patients with a high likelihood of HFpEF. We incorporated multi-modality data, including both structured and unstructured predictors, to generate a disease diagnostic signature. The proposed signature was validated in a separate cohort of patients and performed favourably as compared either to expert clinical consensus or the recently proposed H 2 FPEF score [3].
Analysis of the signatures that distinguished HFpEF from non-cardiac causes of dyspnea (non-HF) revealed anticipated predictors such as atrial fibrillation, hypertension, diabetes mellitus, kidney failure and obesity, in accordance with previous literature [3]. In addition, surrogate measures of multiple previous clinical encounters detected by the NLP algorithm as frequent text references to terms such as "pharmacologic substance" (a reference to drug treatment but not a specific medication) or "patient address" were very useful. This may reflect the fact that patients with HFpEF may require multiple clinical visits and investigations, often with different specialities, before a diagnosis is established [4]. Apart from LVEF itself, features that distinguished HFpEF from HFrEF included age, peripheral edema, and other echocardiographic measures. An advantage of the approach that we employed may be that it is unbiased and comprehensive and identifies variables for inclusion in the diagnostic signature based purely on the results of the objective feature selection process. This may be one reason why our algorithm outperforms the H 2 FPEF score, which is based on the evaluation of selected variables rather than a comprehensive unbiased analysis. In this regard, it is of interest that echocardiographic predictors that contributed to the differentiation of HFpEF from HFrEF included maximum flow velocity across the aortic valve, aortic insufficiency and LA volume whereas E/e' (which is part of the H 2 FPEF score) did not feature in the selected predictors. Indeed, we note that several indices from a standard echocardiographic dataset that are typically used to identify HFpEF do not feature as predictors differentiating HFpEF from HFrEF. These include LV cavity dimensions; LV wall thickness and mass; and E/e' as mentioned above. However, given the defining features of HFpEF versus HFrEF, it is perhaps not surprising that the top differentiating features are variations of quantifying LVEF.
A major underlying problem in efforts to develop or test new treatments for HFpEF is the difficulty in consistently diagnosing the syndrome [4]. Many different approaches are used in the literature based on varying criteria published by national and international societies, and diverse inclusion criteria have been used in clinical trials [29][30][31]. The problem is compounded by the likelihood that HFpEF is a heterogenous syndrome in which sub-populations may have differing underlying pathophysiology and outcomes [21,22,29]. The approach we present enables rapid identification of likely HFpEF cases among which further specific phenotyping could be performed to refine the diagnosis and potentially test or target defined interventions, or to identify potential subjects for research studies. In practice, the output of each of our models is a predicted probability in the range 0-1, for example the HFpEF vs non-HF model could return 0.89, indicating a predicted 89% probability of HFpEF. Importantly, this approach aims to identify both compensated and decompensated HFpEF cases, using an automated and data-driven approach that is effective even where structured data (e.g. NT-proBNP measurements) are scarce. The approach may be considered complementary to scores such as H 2 FPEF. Our signature is ideally suited to rapidly identify a large number of possible HFpEF cases from EHR whereas H 2 FPEF is better suited for use by the clinician evaluating an individual patient who is suspected to have HFpEF. This study is the first to use SHAP analysis for feature selection in this context. We comprehensively validated all variations of the derived models in multiple datasets with underlying variational distributions. We demonstrated a significant improvement in HFpEF diagnostic performance when discriminating the patients with HFpEF from those with HFrEF or no HF history. A key strength of our approach is that modeling numerical assessment data (structured results signature) and EHR concept references separately makes the models applicable in scenarios where one of these sources of data may be scarce. Moreover, the dual modeling of HFpEF separation from non-HF and HFrEF subjects increases the utility of the proposed pipeline in distinguishing among a wider group of clinical conditions.

Limitations
The UMLS clinical concept encoding that was used to extract unstructured observations does not support distinct encoding of different disease stages and could therefore cause some inaccuracy. In a more general aspect, the a priori assumptions that we made to identify definite HFpEF cases in the derivation dataset influenced the characterisation of the cohort. For example, we utilised ICD-10 diagnostic codes in the identification of patients with heart failure. Previous studies have demonstrated inaccuracy in identifying incident heart failure using ICD-10 coding as the sole source [32]. It is possible that such inaccuracy is present in our coding system; however the use of additional features (symptoms, LVEF, BNP/ NTproBNP) in case classification mitigates this risk in our study. Similarly, it is possible that for some patients an HF diagnosis is known but not recorded in the records we accessed, or was recorded but not detected by our NLP algorithm (i.e. a false negative). As we combined a number of other features, including symptoms and blood tests, in assigning our final HF diagnosis labels, we expect the overall impact on the results to be minimal.
The inclusion of a raised BNP criterion restricts the cohort to a subgroup of HFpEF subjects (a proportion of HFpEF patients have a normal BNP), which was evident in test cohorts where many of the subjects did not have BNP measurements. This issue could be successfully handled through transfer learning techniques but would require some labelled data from a new domain to facilitate such a feedback training loop. The choice of data imputation technique could be another source of minor but systematic error. The discriminant power of the model to detect HFpEF is lower in test subsets where the missing data rate is higher and HFpEF cases are a small proportion of the overall number. Finally, the applicability of our model in patients with HFpEF who have never required hospital evaluation or admission is unknown. However, a strength of our approach is that a dedicated specialist assessment for HF is not required to assess the probability of HFpEF among patients undergoing general hospital evaluation (e.g. non-cardiological), even in the absence of commonly used diagnostic data such as NTproBNP levels. The lack of independent validation is a limitation of this study. Evaluation of the derived model's performance in independent datasets from other centres and in community-based datasets will be informative in future studies. Although we compared performance of the model with the H 2 FPEF score [3], due to its stated aim of estimating the likelihood that HFpEF among patients with unexplained dyspnea to guide further testing, we did not compare performance to the HFA-PEFF algorithm which is a multi-step diagnostic algorithm [33]. Furthermore, the comparison of our algorithm's performance with the H 2 FPEF should be confirmed in a separate validation cohort. The HFrEF group (Control Cohort II) comprised patients with a diagnosis of HF and reduced LVEF on echocardiogram using a cut-off value of < 50%. As such, this cohort combines HFrEF and HFmrEF cases as decribed in ESC guidelines [13]. Finally, in our analysis we focus on performance at the group level. Future work should establish the applicability of this method on an individual level, such as focusing on older or younger patients.

Conclusion
In this study, we have developed a rapid and automated data-driven approach that is effective at identifying patients from EHR who are likely to have HFpEF. This algorithm affords significant potential to rapidly identify patients for more detailed analyses and access to evidence-based therapies that are known to improve quality of life and decrease rates of hospitalisation. The approach that we report could in principle be readily applied to other diseases and conditions that are similarly difficult to diagnose.